Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Latte: Latent Diffusion Transformer for Video Generation (2401.03048v1)

Published 5 Jan 2024 in cs.CV
Latte: Latent Diffusion Transformer for Video Generation

Abstract: We propose a novel Latent Diffusion Transformer, namely Latte, for video generation. Latte first extracts spatio-temporal tokens from input videos and then adopts a series of Transformer blocks to model video distribution in the latent space. In order to model a substantial number of tokens extracted from videos, four efficient variants are introduced from the perspective of decomposing the spatial and temporal dimensions of input videos. To improve the quality of generated videos, we determine the best practices of Latte through rigorous experimental analysis, including video clip patch embedding, model variants, timestep-class information injection, temporal positional embedding, and learning strategies. Our comprehensive evaluation demonstrates that Latte achieves state-of-the-art performance across four standard video generation datasets, i.e., FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. In addition, we extend Latte to text-to-video generation (T2V) task, where Latte achieves comparable results compared to recent T2V models. We strongly believe that Latte provides valuable insights for future research on incorporating Transformers into diffusion models for video generation.

Introduction to Latte: A Novel Approach to Video Generation

In the field of AI-driven video generation, the development of models capable of creating high-quality videos has remained an intricate challenge. This is largely due to the high-dimensional and complex nature of video content. Yet, recent advances in diffusion models, originally utilized for image generation tasks, suggest a new frontier for video generation. Building on such advancements, a new approach has been introduced: a Latent Diffusion Transformer model, designated as Latte. Latte leverages the potent capabilities of Transformer blocks to encode spatial and temporal video data within a latent space.

Core Principles of Latte

Latte employs a variational autoencoder that converts input videos into a latent space. It then extracts tokens from these transformed features and applies a Transformer, an architecture well-established for capturing long-range dependencies, to these tokens. Recognizing the challenge presented by the number of tokens required to characterize video content, Latte introduces four efficient model variants. Each is designed with a structural approach that smartly decomposes spatial and temporal dimensions. The result addresses the problem of handling a massive volume of tokens without compromising efficiency.

Refining Video Generation with Latte

A comprehensive exploration into the nuances of Transformer-based latent diffusion models for video generation has led to several key findings. Through methodical analysis, optimal practices have been identified, including the methods for video clip patch embedding, the introduction of timestep-class information, the embedding of temporal positional information, and the strategies for learning. By integrating these best practices, Latte is capable of creating photorealistic videos with temporally coherent content, surpassing other methods in performance across various video generation benchmarks.

Assessment and Application

Latte has been scrupulously evaluated on multiple video generation datasets, demonstrating superior performance via Inception Score (IS), Fréchet Video Distance (FVD), and Fréchet Inception Distance (FID) metrics. Not only does Latte excel in creating videos from latent representations, but it has also shown promising capabilities in text-to-video (T2V) generation tasks. Benchmarked against existing T2V models, Latte exhibits competitive results, indicating its versatility in handling diverse video generation applications.

In conclusion, the proposed Latent Diffusion Transformer model, Latte, stands as a significant advancement in video generation, thanks to its strategic use of Transformer architecture in diffusion models. With its state-of-the-art performance and versatile application to T2V tasks, it provides valuable insights and opens new avenues for further research in this rapidly evolving field. The full project, including the data supporting the findings of this paper, is available to the public, encouraging collaboration and innovation within the community.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (79)
  1. Bao F, Nie S, Xue K, et al (2023) All are worth words: A vit backbone for diffusion models. In: Computer Vision and Pattern Recognition, pp 22669–22679 Blattmann et al [2023a] Blattmann A, Dockhorn T, Kulal S, et al (2023a) Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:231115127 Blattmann et al [2023b] Blattmann A, Rombach R, Ling H, et al (2023b) Align your latents: High-resolution video synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 22563–22575 Chen et al [2023a] Chen J, Yu J, Ge C, et al (2023a) Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:231000426 Chen et al [2020] Chen M, Radford A, Child R, et al (2020) Generative pretraining from pixels. In: International Conference on Machine Learning, PMLR, pp 1691–1703 Chen et al [2023b] Chen R, Chen Y, Jiao N, et al (2023b) Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In: International Conference on Computer Vision Chen et al [2023c] Chen X, Wang Y, Zhang L, et al (2023c) Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:231020700 Deng et al [2009] Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, IEEE, pp 248–255 Devlin et al [2019] Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Blattmann A, Dockhorn T, Kulal S, et al (2023a) Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:231115127 Blattmann et al [2023b] Blattmann A, Rombach R, Ling H, et al (2023b) Align your latents: High-resolution video synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 22563–22575 Chen et al [2023a] Chen J, Yu J, Ge C, et al (2023a) Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:231000426 Chen et al [2020] Chen M, Radford A, Child R, et al (2020) Generative pretraining from pixels. In: International Conference on Machine Learning, PMLR, pp 1691–1703 Chen et al [2023b] Chen R, Chen Y, Jiao N, et al (2023b) Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In: International Conference on Computer Vision Chen et al [2023c] Chen X, Wang Y, Zhang L, et al (2023c) Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:231020700 Deng et al [2009] Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, IEEE, pp 248–255 Devlin et al [2019] Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Blattmann A, Rombach R, Ling H, et al (2023b) Align your latents: High-resolution video synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 22563–22575 Chen et al [2023a] Chen J, Yu J, Ge C, et al (2023a) Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:231000426 Chen et al [2020] Chen M, Radford A, Child R, et al (2020) Generative pretraining from pixels. In: International Conference on Machine Learning, PMLR, pp 1691–1703 Chen et al [2023b] Chen R, Chen Y, Jiao N, et al (2023b) Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In: International Conference on Computer Vision Chen et al [2023c] Chen X, Wang Y, Zhang L, et al (2023c) Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:231020700 Deng et al [2009] Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, IEEE, pp 248–255 Devlin et al [2019] Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Chen J, Yu J, Ge C, et al (2023a) Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:231000426 Chen et al [2020] Chen M, Radford A, Child R, et al (2020) Generative pretraining from pixels. In: International Conference on Machine Learning, PMLR, pp 1691–1703 Chen et al [2023b] Chen R, Chen Y, Jiao N, et al (2023b) Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In: International Conference on Computer Vision Chen et al [2023c] Chen X, Wang Y, Zhang L, et al (2023c) Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:231020700 Deng et al [2009] Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, IEEE, pp 248–255 Devlin et al [2019] Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Chen M, Radford A, Child R, et al (2020) Generative pretraining from pixels. In: International Conference on Machine Learning, PMLR, pp 1691–1703 Chen et al [2023b] Chen R, Chen Y, Jiao N, et al (2023b) Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In: International Conference on Computer Vision Chen et al [2023c] Chen X, Wang Y, Zhang L, et al (2023c) Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:231020700 Deng et al [2009] Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, IEEE, pp 248–255 Devlin et al [2019] Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Chen R, Chen Y, Jiao N, et al (2023b) Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In: International Conference on Computer Vision Chen et al [2023c] Chen X, Wang Y, Zhang L, et al (2023c) Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:231020700 Deng et al [2009] Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, IEEE, pp 248–255 Devlin et al [2019] Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Chen X, Wang Y, Zhang L, et al (2023c) Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:231020700 Deng et al [2009] Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, IEEE, pp 248–255 Devlin et al [2019] Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, IEEE, pp 248–255 Devlin et al [2019] Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  2. Blattmann A, Dockhorn T, Kulal S, et al (2023a) Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:231115127 Blattmann et al [2023b] Blattmann A, Rombach R, Ling H, et al (2023b) Align your latents: High-resolution video synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 22563–22575 Chen et al [2023a] Chen J, Yu J, Ge C, et al (2023a) Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:231000426 Chen et al [2020] Chen M, Radford A, Child R, et al (2020) Generative pretraining from pixels. In: International Conference on Machine Learning, PMLR, pp 1691–1703 Chen et al [2023b] Chen R, Chen Y, Jiao N, et al (2023b) Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In: International Conference on Computer Vision Chen et al [2023c] Chen X, Wang Y, Zhang L, et al (2023c) Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:231020700 Deng et al [2009] Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, IEEE, pp 248–255 Devlin et al [2019] Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Blattmann A, Rombach R, Ling H, et al (2023b) Align your latents: High-resolution video synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 22563–22575 Chen et al [2023a] Chen J, Yu J, Ge C, et al (2023a) Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:231000426 Chen et al [2020] Chen M, Radford A, Child R, et al (2020) Generative pretraining from pixels. In: International Conference on Machine Learning, PMLR, pp 1691–1703 Chen et al [2023b] Chen R, Chen Y, Jiao N, et al (2023b) Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In: International Conference on Computer Vision Chen et al [2023c] Chen X, Wang Y, Zhang L, et al (2023c) Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:231020700 Deng et al [2009] Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, IEEE, pp 248–255 Devlin et al [2019] Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Chen J, Yu J, Ge C, et al (2023a) Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:231000426 Chen et al [2020] Chen M, Radford A, Child R, et al (2020) Generative pretraining from pixels. In: International Conference on Machine Learning, PMLR, pp 1691–1703 Chen et al [2023b] Chen R, Chen Y, Jiao N, et al (2023b) Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In: International Conference on Computer Vision Chen et al [2023c] Chen X, Wang Y, Zhang L, et al (2023c) Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:231020700 Deng et al [2009] Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, IEEE, pp 248–255 Devlin et al [2019] Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Chen M, Radford A, Child R, et al (2020) Generative pretraining from pixels. In: International Conference on Machine Learning, PMLR, pp 1691–1703 Chen et al [2023b] Chen R, Chen Y, Jiao N, et al (2023b) Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In: International Conference on Computer Vision Chen et al [2023c] Chen X, Wang Y, Zhang L, et al (2023c) Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:231020700 Deng et al [2009] Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, IEEE, pp 248–255 Devlin et al [2019] Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Chen R, Chen Y, Jiao N, et al (2023b) Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In: International Conference on Computer Vision Chen et al [2023c] Chen X, Wang Y, Zhang L, et al (2023c) Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:231020700 Deng et al [2009] Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, IEEE, pp 248–255 Devlin et al [2019] Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Chen X, Wang Y, Zhang L, et al (2023c) Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:231020700 Deng et al [2009] Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, IEEE, pp 248–255 Devlin et al [2019] Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, IEEE, pp 248–255 Devlin et al [2019] Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  3. Blattmann A, Rombach R, Ling H, et al (2023b) Align your latents: High-resolution video synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 22563–22575 Chen et al [2023a] Chen J, Yu J, Ge C, et al (2023a) Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:231000426 Chen et al [2020] Chen M, Radford A, Child R, et al (2020) Generative pretraining from pixels. In: International Conference on Machine Learning, PMLR, pp 1691–1703 Chen et al [2023b] Chen R, Chen Y, Jiao N, et al (2023b) Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In: International Conference on Computer Vision Chen et al [2023c] Chen X, Wang Y, Zhang L, et al (2023c) Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:231020700 Deng et al [2009] Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, IEEE, pp 248–255 Devlin et al [2019] Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Chen J, Yu J, Ge C, et al (2023a) Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:231000426 Chen et al [2020] Chen M, Radford A, Child R, et al (2020) Generative pretraining from pixels. In: International Conference on Machine Learning, PMLR, pp 1691–1703 Chen et al [2023b] Chen R, Chen Y, Jiao N, et al (2023b) Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In: International Conference on Computer Vision Chen et al [2023c] Chen X, Wang Y, Zhang L, et al (2023c) Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:231020700 Deng et al [2009] Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, IEEE, pp 248–255 Devlin et al [2019] Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Chen M, Radford A, Child R, et al (2020) Generative pretraining from pixels. In: International Conference on Machine Learning, PMLR, pp 1691–1703 Chen et al [2023b] Chen R, Chen Y, Jiao N, et al (2023b) Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In: International Conference on Computer Vision Chen et al [2023c] Chen X, Wang Y, Zhang L, et al (2023c) Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:231020700 Deng et al [2009] Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, IEEE, pp 248–255 Devlin et al [2019] Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Chen R, Chen Y, Jiao N, et al (2023b) Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In: International Conference on Computer Vision Chen et al [2023c] Chen X, Wang Y, Zhang L, et al (2023c) Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:231020700 Deng et al [2009] Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, IEEE, pp 248–255 Devlin et al [2019] Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Chen X, Wang Y, Zhang L, et al (2023c) Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:231020700 Deng et al [2009] Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, IEEE, pp 248–255 Devlin et al [2019] Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, IEEE, pp 248–255 Devlin et al [2019] Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  4. Chen J, Yu J, Ge C, et al (2023a) Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:231000426 Chen et al [2020] Chen M, Radford A, Child R, et al (2020) Generative pretraining from pixels. In: International Conference on Machine Learning, PMLR, pp 1691–1703 Chen et al [2023b] Chen R, Chen Y, Jiao N, et al (2023b) Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In: International Conference on Computer Vision Chen et al [2023c] Chen X, Wang Y, Zhang L, et al (2023c) Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:231020700 Deng et al [2009] Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, IEEE, pp 248–255 Devlin et al [2019] Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Chen M, Radford A, Child R, et al (2020) Generative pretraining from pixels. In: International Conference on Machine Learning, PMLR, pp 1691–1703 Chen et al [2023b] Chen R, Chen Y, Jiao N, et al (2023b) Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In: International Conference on Computer Vision Chen et al [2023c] Chen X, Wang Y, Zhang L, et al (2023c) Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:231020700 Deng et al [2009] Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, IEEE, pp 248–255 Devlin et al [2019] Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Chen R, Chen Y, Jiao N, et al (2023b) Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In: International Conference on Computer Vision Chen et al [2023c] Chen X, Wang Y, Zhang L, et al (2023c) Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:231020700 Deng et al [2009] Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, IEEE, pp 248–255 Devlin et al [2019] Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Chen X, Wang Y, Zhang L, et al (2023c) Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:231020700 Deng et al [2009] Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, IEEE, pp 248–255 Devlin et al [2019] Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, IEEE, pp 248–255 Devlin et al [2019] Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  5. Chen M, Radford A, Child R, et al (2020) Generative pretraining from pixels. In: International Conference on Machine Learning, PMLR, pp 1691–1703 Chen et al [2023b] Chen R, Chen Y, Jiao N, et al (2023b) Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In: International Conference on Computer Vision Chen et al [2023c] Chen X, Wang Y, Zhang L, et al (2023c) Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:231020700 Deng et al [2009] Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, IEEE, pp 248–255 Devlin et al [2019] Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Chen R, Chen Y, Jiao N, et al (2023b) Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In: International Conference on Computer Vision Chen et al [2023c] Chen X, Wang Y, Zhang L, et al (2023c) Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:231020700 Deng et al [2009] Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, IEEE, pp 248–255 Devlin et al [2019] Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Chen X, Wang Y, Zhang L, et al (2023c) Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:231020700 Deng et al [2009] Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, IEEE, pp 248–255 Devlin et al [2019] Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, IEEE, pp 248–255 Devlin et al [2019] Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  6. Chen R, Chen Y, Jiao N, et al (2023b) Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In: International Conference on Computer Vision Chen et al [2023c] Chen X, Wang Y, Zhang L, et al (2023c) Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:231020700 Deng et al [2009] Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, IEEE, pp 248–255 Devlin et al [2019] Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Chen X, Wang Y, Zhang L, et al (2023c) Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:231020700 Deng et al [2009] Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, IEEE, pp 248–255 Devlin et al [2019] Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, IEEE, pp 248–255 Devlin et al [2019] Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  7. Chen X, Wang Y, Zhang L, et al (2023c) Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:231020700 Deng et al [2009] Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, IEEE, pp 248–255 Devlin et al [2019] Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, IEEE, pp 248–255 Devlin et al [2019] Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  8. Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, IEEE, pp 248–255 Devlin et al [2019] Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  9. Devlin J, Chang MW, Lee K, et al (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics : Human Language Technologies Dhariwal and Nichol [2021] Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  10. Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Neural Information Processing Systems 34:8780–8794 Dosovitskiy et al [2021] Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  11. Dosovitskiy A, Beyer L, Kolesnikov A, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations Ge et al [2022] Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  12. Ge S, Hayes T, Yang H, et al (2022) Long video generation with time-agnostic vqgan and time-sensitive transformer. In: European Conference on Computer Vision, Springer, pp 102–118 Harvey et al [2022] Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  13. Harvey W, Naderiparizi S, Masrani V, et al (2022) Flexible diffusion modeling of long videos. Neural Information Processing Systems 35:27953–27965 He et al [2016] He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  14. He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp 770–778 He et al [2023] He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  15. He Y, Yang T, Zhang Y, et al (2023) Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:221113221 Ho et al [2020] Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  16. Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Neural Information Processing Systems 33:6840–6851 Ho et al [2022] Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  17. Ho J, Salimans T, Gritsenko A, et al (2022) Video diffusion models. In: Neural Information Processing Systems Huang et al [2017] Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  18. Huang H, He R, Sun Z, et al (2017) Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In: International Conference on Computer Vision, pp 1689–1697 Jia et al [2021] Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  19. Jia G, Zheng M, Hu C, et al (2021) Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science 3(3):308–319 Jia et al [2022] Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  20. Jia G, Huang H, Fu C, et al (2022) Rethinking image cropping: Exploring diverse compositions from global views. In: Computer Vision and Pattern Recognition, pp 2446–2455 Kahembwe and Ramamoorthy [2020] Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  21. Kahembwe E, Ramamoorthy S (2020) Lower dimensional kernels for video discriminators. Neural Networks 132:506–520 Kaplan et al [2020] Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  22. Kaplan J, McCandlish S, Henighan T, et al (2020) Scaling laws for neural language models. arXiv preprint arXiv:200108361 Lu et al [2023] Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  23. Lu H, Yang G, Fei N, et al (2023) Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:230513311 Luo et al [2021a] Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  24. Luo M, Cao J, Ma X, et al (2021a) Fa-gan: Face augmentation gan for deformation-invariant face recognition. IEEE Transactions on Information Forensics and Security 16:2341–2355 Luo et al [2021b] Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  25. Luo M, Ma X, Li Z, et al (2021b) Partial nir-vis heterogeneous face recognition with automatic saliency search. IEEE Transactions on Information Forensics and Security 16:5003–5017 Luo et al [2022] Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  26. Luo M, Ma X, Huang H, et al (2022) Style-based attentive network for real-world face hallucination. In: Pattern Recognition and Computer Vision, Springer, pp 262–273 Luo et al [2023] Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  27. Luo Z, Chen D, Zhang Y, et al (2023) Videofusion: Decomposed diffusion models for high-quality video generation. In: Computer Vision and Pattern Recognition Ma et al [2021] Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  28. Ma X, Zhou X, Huang H, et al (2021) Free-form image inpainting via contrastive attention network. In: International Conference on Pattern Recognition, IEEE, pp 9242–9249 Ma et al [2022] Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  29. Ma X, Zhou X, Huang H, et al (2022) Contrastive attention network with dense field estimation for face completion. Pattern Recognition 124:108465 Ma et al [2023] Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  30. Ma X, Zhou X, Huang H, et al (2023) Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications p 121148 Mei and Patel [2023] Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  31. Mei K, Patel V (2023) Vidm: Video implicit diffusion models. In: AAAI Conference on Artificial Intelligence, pp 9117–9125 Meng et al [2022] Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  32. Meng C, He Y, Song Y, et al (2022) Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations Neimark et al [2021] Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  33. Neimark D, Bar O, Zohar M, et al (2021) Video transformer network. In: International Conference on Computer Vision, pp 3163–3172 Nichol and Dhariwal [2021] Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  34. Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, PMLR, pp 8162–8171 Parmar et al [2021] Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  35. Parmar G, Zhang R, Zhu JY (2021) On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:210411222 5:14 Parmar et al [2023] Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  36. Parmar G, Kumar Singh K, Zhang R, et al (2023) Zero-shot image-to-image translation. In: ACM SIGGRAPH Conference, pp 1–11 Parmar et al [2018] Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  37. Parmar N, Vaswani A, Uszkoreit J, et al (2018) Image transformer. In: International Conference on Machine Learning, PMLR, pp 4055–4064 Peebles and Xie [2023] Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  38. Peebles W, Xie S (2023) Scalable diffusion models with transformers. In: International Conference on Computer Vision, pp 4195–4205 Perez et al [2018] Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  39. Perez E, Strub F, De Vries H, et al (2018) Film: Visual reasoning with a general conditioning layer. In: AAAI Conference on Artificial Antelligence Pota et al [2020] Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  40. Pota M, Esposito M, De Pietro G, et al (2020) Best practices of convolutional neural networks for question classification. Applied Sciences 10(14):4710 Rakhimov et al [2021] Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  41. Rakhimov R, Volkhonskiy D, Artemov A, et al (2021) Latent video transformer. In: Computer Vision, Imaging and Computer Graphics Theory and Applications Ramesh et al [2022] Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  42. Ramesh A, Dhariwal P, Nichol A, et al (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:220406125 1(2):3 Rombach et al [2022] Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  43. Rombach R, Blattmann A, Lorenz D, et al (2022) High-resolution image synthesis with latent diffusion models. In: Computer Vision and Pattern Recognition, pp 10684–10695 Ronneberger et al [2015] Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  44. Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, Springer, pp 234–241 Rössler et al [2018] Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  45. Rössler A, Cozzolino D, Verdoliva L, et al (2018) Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:180309179 Ruiz et al [2023] Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  46. Ruiz N, Li Y, Jampani V, et al (2023) Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Computer Vision and Pattern Recognition, pp 22500–22510 Saharia et al [2022a] Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  47. Saharia C, Chan W, Chang H, et al (2022a) Palette: Image-to-image diffusion models. In: ACM SIGGRAPH Conference, pp 1–10 Saharia et al [2022b] Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  48. Saharia C, Chan W, Saxena S, et al (2022b) Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems 35:36479–36494 Saito et al [2017] Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  49. Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: International Conference on Computer Vision, pp 2830–2839 Shen et al [2023] Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  50. Shen X, Li X, Elhoseiny M (2023) Mostgan-v: Video generation with temporal motion styles. In: Computer Vision and Pattern Recognition, pp 5652–5661 Shue et al [2023] Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  51. Shue JR, Chan ER, Po R, et al (2023) 3d neural field generation using triplane diffusion. In: Computer Vision and Pattern Recognition, pp 20875–20886 Siarohin et al [2019] Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  52. Siarohin A, Lathuilière S, Tulyakov S, et al (2019) First order motion model for image animation. Neural Information Processing Systems 32 Simonyan and Zisserman [2014] Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  53. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Neural Information Processing Systems 27 Singer et al [2022] Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  54. Singer U, Polyak A, Hayes T, et al (2022) Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:220914792 Skorokhodov et al [2022] Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  55. Skorokhodov I, Tulyakov S, Elhoseiny M (2022) Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In: Computer Vision and Pattern Recognition, pp 3626–3636 Song et al [2021a] Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  56. Song J, Meng C, Ermon S (2021a) Denoising diffusion implicit models. In: International Conference on Learning Representations Song et al [2021b] Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  57. Song Y, Sohl-Dickstein J, Kingma DP, et al (2021b) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations Soomro et al [2012] Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  58. Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11) Su et al [2021] Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  59. Su J, Lu Y, Pan S, et al (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:210409864 Tian et al [2021] Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  60. Tian Y, Ren J, Chai M, et al (2021) A good image generator is what you need for high-resolution video synthesis. In: International Conference on Learning Representations Tulyakov et al [2018] Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  61. Tulyakov S, Liu MY, Yang X, et al (2018) Mocogan: Decomposing motion and content for video generation. In: Computer Vision and Pattern Recognition, pp 1526–1535 Unterthiner et al [2018] Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  62. Unterthiner T, Van Steenkiste S, Kurach K, et al (2018) Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:181201717 Vaswani et al [2017] Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  63. Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Neural Information Processing Systems 30 Vondrick et al [2016] Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  64. Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. Neural Information Processing Systems 29 Wang et al [2023a] Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  65. Wang H, Du X, Li J, et al (2023a) Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Computer Vision and Pattern Recognition, pp 12619–12629 Wang et al [2020a] Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  66. Wang Y, Bilinski P, Bremond F, et al (2020a) G3an: Disentangling appearance and motion for video generation. In: Computer Vision and Pattern Recognition, pp 5264–5273 Wang et al [2020b] Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  67. Wang Y, Bilinski P, Bremond F, et al (2020b) Imaginator: Conditional spatio-temporal gan for video generation. In: Winter Conference on Applications of Computer Vision Wang et al [2023b] Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  68. Wang Y, Chen X, Ma X, et al (2023b) Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:230915103 Wang et al [2023c] Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  69. Wang Y, Ma X, Chen X, et al (2023c) Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:230503989 Weissenborn et al [2020] Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  70. Weissenborn D, Täckström O, Uszkoreit J (2020) Scaling autoregressive video models. In: International Conference on Learning Representations Xiong et al [2018] Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  71. Xiong W, Luo W, Ma L, et al (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Computer Vision and Pattern Recognition, pp 2364–2373 Yan et al [2021] Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  72. Yan W, Zhang Y, Abbeel P, et al (2021) Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:210410157 Yu et al [2022] Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  73. Yu S, Tack J, Mo S, et al (2022) Generating videos with dynamics-aware implicit generative adversarial networks. In: International Conference on Learning Representations Yu et al [2023] Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  74. Yu S, Sohn K, Kim S, et al (2023) Video probabilistic diffusion models in projected latent space. In: Computer Vision and Pattern Recognition, pp 18456–18466 Zhang et al [2023] Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  75. Zhang L, Rao A, Agrawala M (2023) Adding conditional control to text-to-image diffusion models. In: International Conference on Computer Vision, pp 3836–3847 Zhao et al [2022] Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  76. Zhao M, Bao F, Li C, et al (2022) Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Neural Information Processing Systems 35:3609–3623 Zhou et al [2021] Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  77. Zhou L, Du Y, Wu J (2021) 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision, pp 5826–5835 Zhou et al [2022] Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  78. Zhou Y, Zhang R, Chen C, et al (2022) Towards language-free training for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 17907–17917 Zhou et al [2023] Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166 Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
  79. Zhou Y, Liu B, Zhu Y, et al (2023) Shifted diffusion for text-to-image generation. In: Computer Vision and Pattern Recognition, pp 10157–10166
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Xin Ma (105 papers)
  2. Yaohui Wang (50 papers)
  3. Gengyun Jia (5 papers)
  4. Xinyuan Chen (48 papers)
  5. Ziwei Liu (368 papers)
  6. Yuan-Fang Li (90 papers)
  7. Cunjian Chen (21 papers)
  8. Yu Qiao (563 papers)
Citations (136)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com